Regex support in Python→Laurel translation#623
Conversation
Add re.compile, re.match, re.search, re.fullmatch flowing through Laurel to Core's SMT-LIB string theory. Laurel prelude: from_regex constructor on Any, re_Match composite type with uninterpreted method stubs. Core prelude: re_to_regex normalizer (str|Pattern→regex), re_fullmatch_bool/re_match_bool/re_search_bool with correct anchoring, re_compile, re_fullmatch, re_match, re_search with disjunctive requires. PyFactory: re_compile_str with concreteEval calling pythonRegexToCore. ReToCore: fix latent bug — add LMonoTy type annotations to all regex expression builders so concreteEval output survives the SMT encoder (which rejects unannotated ops). The old pipeline never triggered this because the old Python→Core translator panics on re.match and DiffTestCore round-trips through text which re-annotates everything. LaurelToCoreTranslator: map Laurel UserDefined "regex" to Core regex sort. StrataMain: pass ReFactory to Laurel pipeline's verifyCore. Test: end-to-end test_regex.py covering fullmatch/match/search with literal and compiled patterns, character classes, quantifiers, alternation, dot, and anchoring differences between modes.
| concreteEval := some | ||
| (fun _ args => match args with | ||
| | [LExpr.strConst () s] => | ||
| let (expr, maybe_err) := pythonRegexToCore s .fullmatch |
There was a problem hiding this comment.
This function (also the pre-existing reCompileFunc) has a problem in that it hard-codes .fullmatch (pythonRegexToCore generates different outputs based on the mode). Making this parametric in the mode won't suffice: we'd need to look ahead in the program to see how the output of re.compile is used and provide that appropriate mode here. Too messy.
I think we need a different solution where Python's re.compile is essentially a no-op semantically. However, Python's re.search, re.match, and re.fullmatch can be implemented using such factory functions, with hard-coded match modes.
Also, could you tell me why reCompileStrFunc doesn't return an exception (unlike reCompileFunc)? pythonRegexToCore does model exceptions -- Python-analog patternError and Strata's own unimplemented where we can return a havoced regex. It'd be a shame to lose this info. at the Laurel level.
There was a problem hiding this comment.
I'm understanding that the exact bug before your commits was not just that the regex compiler was always being called in fullmatch mode, but also there's no way for the prelude regex functions to be written to cope with patterns that themselves use anchors when the regex compiler is uniformly called that way. Egg on face for letting that slip by, etc.
And the reason you want re.compile to be identity on the nose is b/c you are no longer distinguishing between strings and regexes? I think that's fine for now, but would probably need to be changed once we start wanting to support more of the regex standard library. In particular, when we want to start supporting the instance methods on patterns.
Thanks for the exception handling and updating the ReToCore tests. Do you mind quickly explaining the necessity of pythonRegexToCoreEraseTypes?
Replace single re_compile_str (hardcoded .fullmatch mode) with three mode-specific factory functions: re_fullmatch_str, re_match_str, re_search_str. Each compiles the pattern with the correct MatchMode so anchors (^/$) are handled properly. re.compile is now a semantic no-op (returns the pattern string unchanged). The match mode is applied at the point where matching happens, not at compile time. This fixes incorrect behavior for patterns with explicit anchors used across different match modes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add re_pattern_error factory function that surfaces genuinely malformed patterns as RePatternError, modeling Python's re.error. Unimplemented features return NoError (Python would succeed; sound over-approximation comes from mode-specific factory staying uninterpreted). - Prelude checks re_pattern_error first and returns exception(err) for pattern errors before attempting the match. - Add RePatternError variant to the Laurel Error datatype. - Remove dead reCompileFunc (PyReCompile) — never wired into either pipeline. - Replace old candidate-translation comment with architecture overview documenting the double-parse methodology. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add ~130 new test assertions covering anchors across all modes, escaped metacharacters, colon patterns, multi-char anchor patterns, anchors inside alternation, compile+anchor interaction, and more. - Add 9 expected-failure tests (currently unknown due to Laurel pipeline SAT-finding limitation) documenting properties the solver should disprove but cannot. - Remove from_regex constructor from Any datatype — unused since re.compile is now a no-op returning the pattern string. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace manual .tcons construction with mty[regex], mty[string → regex], etc. for readability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add type annotations (mty[Error], mty[string → Error]) to
RePatternError/NoError constructors in rePatternErrorFunc concreteEval,
fixing SMT encoding errors for malformed patterns.
- Add malformed pattern tests in two styles:
- m != None (passes): proves exception value is produced
- try/except (unknown): idiomatic Python pattern; VC simplifies to
(assert true) after concreteEval but quantified prelude axioms make
SAT-finding intractable for the solver.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MikaelMayer
left a comment
There was a problem hiding this comment.
One minor test update.
StrataTest/Languages/Python/expected_laurel/test_regex.expected
Outdated
Show resolved
Hide resolved
…regex # Conflicts: # StrataTest/Languages/Python/run_py_analyze_sarif.py
… feat-laurel-regex # Conflicts: # StrataTest/Languages/Python/run_py_analyze_sarif.py
72f7902 to
98d20a1
Compare
- Handle Init.boolFalse as none in DDM option parser (fixes parsing of pyspec ion files generated with Bool default values) - containsKey(kwargs, key) translates to parameter presence check (!Any..isfrom_none(param)) instead of Dict lookup - Generate required-parameter assertions for kwargs fields without defaults - Disable regexMatch until PR #623 lands (emit assert true)
The regex forward declarations in the DDM prelude are before CoreOnlyDelimiter and thus stripped by coreOnlyFromRuntimeCorePart. Pass ReFactory to Core.typeCheck so the regex function signatures are available during type checking, matching what pyAnalyzeLaurel does at verification time.
# Conflicts: # StrataTest/Languages/Python/run_py_analyze_sarif.py
1e9572c to
7991b22
Compare
Define PythonFactory (Core.Factory + ReFactory) and use it in pyAnalyzeToGotoCommand and pyAnalyzeLaurelToGotoCommand. Add a factory parameter to typeCheckCore and inlineCoreToGotoFiles (defaulting to Core.Factory) so non-Python callers are unaffected.
MikaelMayer
left a comment
There was a problem hiding this comment.
Well-structured PR. The architecture — deferring regex compilation to match time so the correct MatchMode is applied — is a clean design that avoids the re.compile mode-ambiguity problem. The type-annotation fix in ReToCore.lean is well-motivated and the abbreviations (reTy, s2r, etc.) keep the builders readable. The pythonRegexToCoreEraseTypes test helper is a nice way to keep existing #guard_msgs output stable. Test coverage is thorough.
A few observations below.
Add regexType DDM op to LaurelGrammar.st, map it to .TCore "regex" in ConcreteToAbstractTreeTranslator, and remove the special case in LaurelToCoreTranslator. Add comment in PyFactory on irrelevant mode in rePatternErrorFunc.
Description of changes:
Add re.compile, re.match, re.search, re.fullmatch for Python->Laurel
Test: end-to-end test_regex.py covering fullmatch/match/search with literal and compiled patterns, character classes, quantifiers, alternation, dot, and anchoring differences between modes.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.